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Abstract 

Likelihood  based  regresssion  models,  such  as  the  normal  linear  regression  model  and  the  linear  logistic 
model,  assume  a  linear  (or  some  other  parametric)  form  for  the  covariate  effects.  We  introduce  the 
Local  Scoring  procedure  which  replaces  the  linear  form  £  X&  by  a  sum  of  smooth  functions  £  (X,  ). 

The  sy(  )’s  are  unspecified  functions  that  are  estimated  using  scatterplot  smoothers.  The  technique  is 
applicable  to  any  likelihood-based  regression  model:  the  class  of  Generalized  Linear  Models  contains 
many  of  these.  In  this  class  the  Local  Scoring  procedure  replaces  the  linear  predictor  r/  —  £  Xjpj  by 
the  additive  predictor  hence  the  name  Generalized  Additive  Models.  Local  Scoring  can  also 

be  applied  to  non-standard  models  like  Cox’s  proportional  hazards  model  for  survival  data. 

In  a  number  of  real  data  examples,  the  Local  Scoring  procedure  proves  to  be  useful  in  uncovering 
non-linear  covariate  effects.  It  has  the  advantage  of  being  completely  automatic,  i.e  no  “detective  work” 
is  needed  on  the  part  of  the  statistician. 

In  a  further  generalization,  the  technique  is  modified  to  estimate  the  form  of  the  link  function  for 
generalized  linear  models. 

The  Local  Scoring  procedure  is  shown  to  be  asymptotically  equivalent  to  Local  Likelihood  esti¬ 
mation,  another  technique  for  estimating  smooth  covariate  functions.  They  are  seen  to  produce  very 
similar  results  with  real  data,  with  Local  Scoring  being  considerably  faster. 

As  a  theoretical  underpinning,  we  view  Local  Scoring  and  Local  Likelihood  as  empirical  maximizers 

of  the  expected  log-hkehhood,  and  this  makes  clear  their  connection  to  standard  maximum  likelihood 
estimation. 

A  method  for  estimating  the  “degrees  of  freedom”  of  the  procedures  is  also  given. 
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1.  Introduction. 


Likelihood-based  regression  models  are  important  tools  in  data  analysis.  A  typical  scenario 
goes  as  follows.  A  likelihood  is  assumed  for  a  response  variable  Y,  and  the  mean  or  some 
other  parameter  is  modeled  as  a  linear  function  of  a  set  of  covariates  Xi,  X2, . . .  Xp.  The 
parameters  of  the  linear  function  are  then  estimated  by  maximum  likelihood.  Examples  of 
this  are  the  normal  linear  regression  model,  the  logistic  regression  model  for  binary  data,  and 
Cox’s  proportional  hazards  model  for  survival  data.  These  models  all  assume  a  linear  (or  some 
parametric)  form  for  the  covariates  effects. 

In  the  linear  regression  problem,  there  has  been  a  trend  in  the  past  few  years  to  move 
away  from  linear  functions  and  model  the  dependence  of  Y  on  XuX2,...Xp  in  a  more  non- 
parametric  fashion.  For  a  single  covariate,  such  a  model  would  be  Y  =  s(X)  +  error  where  s(A) 
is  a  unspecified  smooth  function.  This  function  can  be  estimated  by  any  so-called  scatterplot 
smoother,  for  example  a  running  mean,  running  median,  running  least  squares  line,  kernel 
estimate,  or  a  spline.  The  resulting  smooth  estimate  is  useful  both  as  a  descriptive  tool  and  as 
a  predictive  model.  For  the  p  covariates  X\,X2  . . .  Xp,  one  can  use  a  fc-dimensional  scatterplot 
smoother  to  estimate  s(.X’),  or  else  assume  a  less  general  model  such  as  s(.X')  =  Sj(Xj)  and 
estimate  it  in  a  forward  stepwise  manner. 

In  this  paper,  we  propose  a  technique  for  estimating  smooth  covariate  functions  in  any 
likelihood-based  regression  model.  We  call  it  the  Local  Scoring  algorithm.  This  procedure 
uses  scatterplot  smoothers  to  generalize  the  usual  Fisher  Scoring  procedure  for  computing 
maximum  likelihood  estimates.  For  example,  the  linear  logistic  model  for  binary  data  specifies 
l°gp/(l  ~  p)  =  P 0  +  P\Xi  +  •  •  •  PXP.  This  can  be  generalized  to  logp/(l  -  p)  =  Sj(Xj), 
and  the  Local  Scoring  procedure  provides  non-parametric,  smooth  estimates  of  the 

The  Gaussian  and  logistic  models  are  members  of  the  class  of  generalized,  linear  models 
(GLM’s)  (Nelder  and  Wedderburn,  1972).  This  comprehensive  class  restricts  Y  to  be  in  the 
exponential  family;  the  statistical  package  GLIM  (Generalized  Linear  Interactive  Modeling) 
performs  estimation  and  diagnostic  checking  of  these  models.  The  local  scoring  procedure 
generalizes  GLM’s  by  replacing  the  linear  predictor  p  =  by  an  additive  predictor  of 

the  form  p  =  J2  sj{Xj)-  The  third  example  mentioned  earlier,  the  proportional  hazards  model, 
is  not  in  the  exponential  family,  and  the  likelihood  it  uses  is  not  in  fact  a  true  likelihood  at  all. 
Nevertheless,  we  still  think  of  it  as  a  “likelihood-based”  regression  model,  and  the  local  scoring 
procedure  can  be  applied.  The  usual  form  for  the  relative  risk,  exp QuXjfij)  is  replaced  by 
the  more  general  form  expQT)  Sj(Xj)). 

The  local  scoring  procedure  is  asymptotically  equivalent  to  another  method  for  estimating 
smooth  covariate  functions,  Local  Likelihood  estimation  (Tibshirani,  1982,  Hastie,  1983a,  and 
Tibshirani,  1984).  In  this  paper  we  compare  the  two  techniques  in  some  examples  and  find 
that  the  estimated  functions  are  very  similar.  The  advantage  of  the  local  scoring  method  is 
that  it  is  considerably  faster. 
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As  a  further  generalization,  the  local  scoring  procedure  can  be  extended  to  provide  non- 
parametric  estimation  of  the  link  function.  This  facilitates,  for  example,  estimation  of  the 
model  f(p)  =  for  binary  data.  It  also  provides  a  check  of  two  of  the  assumptions 

inherent  in  linear  logistic  modeling:  the  linear  form  for  the  covariates  and  the  logit  link. 
The  complete  algorithm,  with  covariate  and  link  function  estimation,  can  be  viewed  as  a 
generalization  of  Friedman  and  Owen’s  PACE  model  to  non- Gaussian  data  (Friedman  and 
Owen,  1984). 

This  paper  is  non- technical  for  the  most  part,  with  an  emphasis  on  the  techniques  and 
their  illustration  through  examples.  In  Section  2,  we  review  the  linear  regression  model  and  its 
generalization  (additive  models  built  up  from  covariate  smooths).  Section  3  reviews  generalized 
linear  models.  In  Section  4,  we  link  smoothing  and  generalized  linear  models  to  produce  a  more 
general  model.  The  two  techniques  for  estimation  are  introduced  and  illustrated. 

In  Section  5,  we  present  a  unified  framework  in  which  to  view  the  estimation  procedures. 
Section  6  contains  examples  of  the  procedures,  including  the  logistic  model  and  Cox’s  model  for 
censored  data.  In  Section  7  we  discuss  multiple  covariate  models  and  backfitting  procedures. 

Section  8  compares  the  Local  Scoring  and  Local  Likelihood  procedures,  Section  9  deals 
with  their  asymptotic  properties,  and  in  Section  10  we  discuss  inference  for  the  models.  In 
Section  11  we  analyze  the  air  pollution  data  which  Breiman  and  Friedman  (1982)  analyzed 
using  their  ACE  model. 

Section  12  details  estimation  of  the  link  function  as  well  the  the  covariate  functions,  and 
shows  the  connection  to  the  PACE  model.  Finally,  in  Section  13,  we  discuss  the  relationship 
of  Generalized  Additive  Models  to  other  models  suggested  in  the  literature. 

2.  The  Linear  Regression  Model  and  its  Smooth  Extension. 

Our  discussion  will  center  on  a  response  random  variable  Y,  and  a  set  of  predictor  random 
variables  Xi,X2,-  ■  •  Xp.  A  set  of  n  independent  realizations  of  these  random  variables  will  be 
denoted  by  (t/i ,  xn, . . .  xip), . . .  (yn,  xni . . .  xnp).  When  working  with  a  single  predictor  (p  =  1), 
we’ll  denote  it  by  X  and  its  realizations  by  xi,x2,.  ..xn. 

A  regression  procedure  can  be  viewed  as  a  method  for  estimating  E  (Y  \  X\,  X2, . . .  Xp). 
The  standard  linear  regression  model  assumes  a  simple  form  for  this  conditional  expectation: 

E(^  \Xi,X2). .  .Xp)  =  /30  +  Xi/3i  +  ...Xpj3p.  (1) 

Given  a  sample,  estimates  of  (3q,  /3i,  . . .  f}p  are  usually  obtained  by  least  squares. 

The  additive  smooth  model  generalizes  the  linear  regression  model.  In  place  of  (l),  we 
assume 

E  (Y  |  Xu  X2  . . .  Xp)  =  so  +  £  «,•(*/)  (2) 

i=i 
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where  the  «,■(•)’ s  are  smooth  functions  standardized  so  that  Esj(X})  =  0.  These  functions  are 
estimated  one  at  a  time,  in  a  forward  stepwise  manner.  Estimation  of  each  sy(-)  is  achieved 
through  a  scatterplot  smoother. 

2.1.  Scatterplot  Smoothers. 

Let’s  look  first  at-  the  case  of  a  single  predictor.  Our  model  is 


E(r|x)  =  s(x)  (3) 

(If  there  is  only  one  smooth  function,  we  suppress  the  constant  term  so  and  absorb  it  into  the 
function).  To  estimate  s(x)  from  data,  we  can  use  any  reasonable  estimate  of  E(y  \X  =  x). 
One  class  of  estimates  are  the  local  average  estimates : 

l(xt)  =  Avey€Ar.[yy]  (4) 

where  “Ave”  represents  some  averaging  operator  like  the  mean  and  N{  is  a  neighbourhood  of  x{ 
(a  set  of  indices  of  points  whose  x  values  are  close  to  xt).  The  only  type  of  neighbourhoods  we’ll 
consider  in  this  paper  are  symmetric  nearest  neighbourhoods.  Associated  with  a  neighbourhood 
is  the  span  or  window  size  w,  this  is  the  proportion  of  the  total  points  contained  in  each 
neighbourhood.  Assuming  that  the  data  points  are  sorted  by  increasing  x  value,  a  span  w 
symmetric  neighbourhood  at  x,-  will  contain  [urn]  points;  half  to  the  left  of  x,  and  half  to  the 
right.  A  formal  definition  is: 


Ni  =  {max(t  — 


ttm 


-,1), ...,»  -  !,»,»  +  1, ...  min(t  + 


tim 


:,n)} 


(5) 


We  see  that  the  neighbourhoods  are  truncated  near  the  end  points  if  W  1  points  are  not 
available.  The  span  w  controls  the  smoothness  of  the  resulting  estimate,  and  must  be  chosen 
in  some  way  from  the  data. 


If  Ave  stands  for  arithmetic  mean,  then  s(')  is  the  running  mean ,  the  simplest  possible 
scatterplot  smoother.  The  running  mean  is  not  a  satisfactory  smoother  because  it  creates 
large  biasses  at  the  endpoints  and  doesn’t  reproduce  straight  lines  (i.e.  if  the  data  lie  exactly 
along  a  straight  line,  the  smooth  of  the  data  will  not  be  a  straight  line).  A  slight  refinement  of 
running  average,  the  running  lines  smoother  alleviates  these  problems.  The  running  lines 
estimate  is  defined  by 

«(*.)  =  A*  +  $u*i  (6) 

A  A 

where  /?o %  and  /3lt-  axe  the  least  squares  estimates  for  the  data  points  in  A,-: 

o  .  _  ~  )yj 

jeN{(x j  ~  xi)2  (7) 

A  A 

Poi  =  Vi  ~ 
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and  Xi  —  *  YljeNi  xh  Vi  —  k  ^jeNi  Vi- 

The  running  lines  smoother  is  the  most  obvious  generalization  of  the  least  squares  line. 
When  every  neighbourhood  contains  100%  of  the  data  points,  the  smooth  agrees  exactly  with 
the  least  squares  line.  Although  very  simple  in  nature,  the  running  lines  smoother  produces 
reasonable  results  and  has  the  advantage  that  the  estimate  in  a  neighbourhood  can  be  found 
by  updating  the  estimate  of  the  previous  neighbourhood.  As  a  result,  a  running  lines  smoother 
can  be  implemented  in  an  O(n)  algorithm,  a  fact  that  will  become  important  when  we  use  it 
as  a  primitive  in  other  procedures.  For  the  rest  of  this  paper,  a  “  Smooth  (•)”  operation  will 
refer  to  a  running  lines  smoother  for  some  fixed  span. 

It  is  important  to  note,  however,  that  the  running  lines  smooth  plays  no  special  role  in  the 
algorithms  that  are  described  in  this  paper.  Other  estimates  of  E(y  |A)  could  be  used,  such 
as  a  kernel  or  spline  smoother.  Except  for  the  increased  computational  cost,  these  smoothers 
could  be  expected  to  work  as  well  or  better  than  the  running  lines  smooth. 

Finally,  using  Smooth  as  a  building  block,  the  full  model  (2)  can  be  estimated  in  a 
forward  stepwise  manner.  This  is  discussed  in  Section  7. 

2.2.  Span  Selection  and  the  Bias-Variance  Tradeoff. 

The  running  lines  smoother  requires  a  choice  of  span  size  w.  Let’s  look  at  the  extreme  choices 
first.  When  w  —  0,  s(x,)  is  just  y,-.  This  is  not  a  good  estimate  because  it  has  a  high  variance 
and  is  not  smooth.  If  w  —  2.0  (that  is  every  neighbourhood  contains  all  the  data  points),  * 
s(-)  is  the  global  least  squares  regression  line.  This  estimate  is  too  smooth  and  will  not  pick 
up  curvature  in  the  underlying  function,  i.e.  will  be  biassed.  Hence  the  span  size  should  be 
chosen  between  0  and  2  to  trade-off  the  bias  and  variability  of  the  estimate. 

A  data-based  criterion  can  be  derived  for  this  purpose  if  we  consider  the  estimates  of 
E  (Y  |  A)  as  empirical  minimizers  of  the  (integrated)  prediction  squared  error 

PSE  =  E(y-5(X))2 

or  equivalently  the  integrated  mean  squared  error 

MSE=  E(E {Y  |X)-s(X))2. 

Let  Sw*(xi)  be  the  running  lines  smooth  of  span  rv,  at  x,-,  having  removed  (x,-,y,)  from  the 
sample.  Then  the  cross-validation  sum  of  squares  is  defined  by  CVSS(w )  =  (l/n)X)i(y«  ~ 
^’(*»))2-  0ne  can  show  that  E  (CV55(tw))  ttt  PSE,  using  the  fact  that  «*,(*,-)  is  independent 
of  y,.  Thus  it  is  reasonable  to  choose  the  span  w  that  produces  the  smallest  value  of  CVSS(u;). 
This  criterion  effectively  weighs  bias  and  variance  based  on  the  sample.  Cross-validation  for 
span  selection  is  discussed  in  Friedman  and  Stuetzle  (1982).  Note  that  if  we  used  the  observed 

*  A  neighbourhood  with  span  1.0  would  only  contain  half  the  data  at  the  endpoints 


4 


residual  error  RSS  —  D"(y,-  ~  sw(ii))2  to  choose  tv,  (»«,(*,•)  being  the  fit  at  x ,•  with  span  w) 
we  would  get  w  =  0  and  hence  s(xt)  =  y,.  Not  surprisingly,  E  (RSS)  ±  PSE.  The  point  is 
that  by  choosing  the  span  to  minimize  an  estimate  of  expected  squared  error,  we  get  a  sensible 
estimate. 

3.  A  Review  of  Generalized  Linear  Models  (GLMs). 

Generalized  linear  models  (Nelder  and  Wedderburn,  1972)  consist  of  a  random  component,  a 
systematic  component,  and  a  link  function,  linking  the  two  components.  The  response  Y  is 
assumed  to  have  exponential  family  density 

fy(y;  0;  <t>)  =  exp{[yfl  -  h{0)\/a{<t>)  +  c(y,  ^)}  (8) 

where  9  is  the  natural  parameter,  and  <f>  is  the  scale  parameter.  This  is  the  random  component 
of  the  model.  We  also  assume  that  the  expectation  of  Y,  denoted  by  >,  is  related  to  the  set 
of  covariates  X%,X2  . .  .Xp  by  g(fi)  =  rj  where  r)  =  +  PiX\  . . .  f3pXp.  rj  is  the  systematic 

component  and  g  is  the  link  function.  Note  that  the  mean  /z  is  related  to  the  natural  parameter 
9  by  M  =  b'(9)\  also,  the  most  commonly  used  link  for  a  given  /  is  called  the  canonical  link, 
for  which  rj  -  6.  It  is  customary,  however,  to  define  the  model  in  terms  of  //  and  r)  =  g(p)  and 
thus  9  does  not  play  a  role.  Hence,  when  convenient  we’ll  write  /v(y,^,^)  as  fY{y,p.,(f>). 

Estimation  of  p,  doesn’t  involve  the  scale  parameter  <f>,  so  for  simplicity  this  will  be  assumed 
known. 

Given  specific  choices  for  the  random  and  systematic  components,  a  link  function,  and 
a  set  of  n  observations,  the  maximum  likelihood  estimate  of  =  (/?0,/?i  . .  .Pp)  can  be  found 
by  a  Fisher  scoring  procedure.  GLIM  uses  an  equivalent  algorithm  called  adjusted  dependent 
variable  regression  .  Given  f?° ,  (a  current  estimate  of  the  linear  predictor),  with  corresponding 
fitted  value  /i°,  we  form  the  “adjusted  dependent  variable” 

=  «°  +  (»  -  M°)(^)o  (9) 


Define  weights  W°  by 

(W0)-1  =  (10) 

where  V°  is  the  variance  of  Y  at  fj,  =  p.0.  The  algorithm  proceeds  by  regressing  z°  on  l,  xi, ...  xp 
with  weights  W°  to  obtain  an  estimate  /?.  Using  a  new  fi  and  fj  are  computed.  A  new  z  is 
computed  and  the  process  is  repeated,  until  the  changes  in  /?  are  sufficiently  small.  Nelder  and 
Wedderburn  show  that  the  adjusted  dependent  variable  algorithm  is  equivalent  to  the  Fisher 
Scoring  procedure.  It  is  attractive  because  no  special  optimization  software  is  required,  just  a 
subroutine  that  computes  weighted  least  squares  estimates. 
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Deviations  between  the  data  and  the  estimated  model  axe  measured  using  the  Deviance, 
defined  by 

Dev(y,/i)  =  2[/(y)  -  /(£)]  (11) 

where  /(/i)  =  23 l°g  /V(y»,  tx,-,  4>)  is  the  log  likelihood.  In  the  case  of  Gaussian  errors  the 
deviance  is  identical  to  residual  sum  of  squares  (RSS)  and  in  the  more  general  case  it  enjoys 
the  Pythagorean  properties  of  the  RSS. 

A  comprehensive  description  of  generalized  linear  models  is  given  by  McCullagh  and 
Nelder  (1983). 

4.  Smooth  Extensions  of  Generalized  Linear  Models. 

4.1.  Specification  of  the  Model. 

The  linear  predictor  r)  =  /30  +  X1f31 . . .  Xp/3p  specifies  that  XuX2,...Xp  act  in  a  linear  fashion. 
A  more  general  model  is 

p 

V  =  so  +  J28}(xi)  (12) 

l 

where  Si(-)  . . .  sp(-)  are  smooth  functions.  These  functions  will  not  be  given  a  parametric  form 
but  instead  will  be  estimated  in  a  non-parametric  fashion. 

4.2.  Estimation  of  the  model  —  Local  Scoring. 

We  require  an  estimate  of  the  sy(-)’s  in  (12).  For  the  linear  model  r)  =  /30  +  Xrfi  +  ...  Xp(3p, 
the  estimates  were  found  by  repeatedly  regressing  the  adjusted  dependent  variable  z  on 
1,  Ai, . . .  Xp.  Since  smoothing  generalizes  linear  regression,  in  the  smooth  model  r)  =  s(X),  we 
can  estimate  s(-)  by  smoothing  the  adjusting  dependent  variable  on  X.  We  call  this  procedure 
Local  Scoring  because  the  Fisher  scoring  update  is  computed  using  a  local  estimate  of  the 
score.  This  sensible  idea  can  be  justified  on  firm  grounds  (see  Section  5);  we  display  in  Figure 
1  (Section  6)  the  results  of  local  scoring  smoothing,  exp(s(x))/(l  +  exp(s(x))),  along  with  the 
usual  linear  estimate,  exp(a  +  /3x)/(l  +  exp(&  +  /?x)),  for  some  binary  response  data.  This  is 
one  of  the  smooths  from  the  analysis  of  Haberman’s  breast  cancer  data,  discussed  in  detail  in 
Sections  6  and  7. 

The  span  at  each  iteration  is  found  by  cross-validation,  as  described  in  Section  2.  Recall 
that  E'(C'VS'S(u;))  «  PSE  for  a  scatterplot  smoother;  the  derivation  of  this  rests  on  the  fact 
that  the  fitted  value  for  y,-  does  not  involve  y,-,  and  is  thus  independent  of  y,-.  In  this  setting,  the 
response  is  the  adjusted  dependent  variable  z,  which  is  a  function  of  y,-.  The  cross- validated 
fit  for  Zf  is  a  function  of  Zj,  j  ^  Since  Zj  is  a  function  of  y,-  from  previous  iterations,  z,- 
is  not  independent  of  its  cross-validated  fit.  However,  if  kn  is  the  number  of  points  in  the 
neighbourhood,  then  one  can  show  that  under  reasonable  conditions  the  dependence  is  only 
0(l/kn). 
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To  obtain  smoother  estimates,  we  use  a  slight  modification  of  this  criterion.  We  choose  a 
larger  span  than  the  cross- validatory  choice  if  it  produces  less  than  a  1%  increase  in  CVSS(w). 

For  the  full  model  (12)  ,  the  smooths  can  be  estimated  one  at  a  time  in  an  iterative 
fashion.  This  idea  is  discussed  in  detail  in  Section  7. 

4.3.  Estimation  of  the  model  —  Local  Likelihood  Estimation. 

Tibshirani  (1982),  Hastie  (1983a)  and  Tibshirani  (1984)  discuss  another  method  for  estimating 
smooth  covariate  functions,  called  Local  Likelihood  estimation.  For  a  single  covariate,  the  usual 
(linear)  procedure  fits  a  line  across  the  entire  range  of  X ,  i.e.  r)  =  p0  +  foX.  To  estimate  the 
model  r)  =  s(X),  the  local  likelihood  procedure  generalizes  this  by  assuming  that  locally  s(x) 
is  linear,  and  fits  a  line  in  a  neighbourhood  around  each  X  value.  In  the  exponential  family 
with  canonical  link,  the  local  likelihood  estimate  of  s(xt-)  is  defined  as 

»(*«)  =  Poi  +  PuXi  (13) 

where  /90»  and  /3lt-  maximize  the  local  likelihood: 

log  L{  =  ^2  {[yjOij  ~  h{6i))}/a{<t>)  +  c(yy,  <f>)}  (14) 

ieN, 

and  $ij  =  f3Qi  +  /3uXj.  The  local  likelihood  smooth  applied  to  the  Haberman  data  is  also  shown 
in  Figure  1,  along  with  the  local  scoring  smooth.  They  are  very  similar,  a  fact  that  seems  to 

be  a  general  phenomenen.  We  discuss  the  relationship  between  the  two  procedures  in  Section 

8. 

Local  scoring  and  local  likelihood  estimation  provide  two  methods  for  estimating  the 
covariate  functions  of  a  generalized  linear  model.  In  the  next  section,  we  introduce  a  theoretical 
framework  in  which  to  view  both  of  these  techniques.  Besides  providing  a  justification  for  the 
methods,  this  framework  also  produces  a  general  form  of  local  scoring  that  can  be  used  in  any 
likelihood-based  regression  model. 

5.  Justification  of  the  Smoothing  Procedures. 

5.1.  The  Expected  Log-likelihood  Criterion. 

In  Section  2  we  discussed  scatterplot  smoothers  as  estimates  of  E(F  \X  =  x).  There  we  saw 
that  by  choosing  the  span  to  minimize  an  estimate  of  expected  squared  error,  (as  opposed  to 
residual  sum  of  squares)  we  obtained  a  sensible  estimate.  In  this  section,  we  will  use  this  idea 
in  a  likelihood  setting,  basing  the  estimation  procedures  on  expected  log  likelihood. 

Consider  a  likelihood  based  regression  model  with  one  covariate.  We  assume  that  the  data 
pairs  (*i,yi),. . .  ( xn,yn )  are  independent  realizations  of  random  variables  X  and  Y.  Assume 
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also  that  given  X  =  x,Y  has  density 

Y  \X  =  x  ~  h(y,r))  (15) 

Since  rj  is  a  function  of  x,  we  will  sometimes  write  t?(x)  for  emphasis.  Denote  the  corresponding 
log-likelihood  for  a  single  observation  by  l(ri,Y),  or  l  for  short.  Now  to  estimate  »)(•),  we  could 
simply  maximize  ]Ci  y»)  over  r?(I2),  •  ■  •  »?(xn)-  This  is  unsatisfactory,  however, 

because  it  doesn’t  force  the  estimate  to  be  smooth.  In  the  logistic  model,  for  example,  it 
produces  rj(x{)  =  +oo  if  y,-  =  1  and  — oo  if  y ,•  =  0,  and  the  estimated  probabilities  are  just  the 
observed  y,’s.  Looking  back  at  the  scatterplot  smoothing  discussion,  we  see  that  the  remedy 
in  the  random  variable  case  is  to  choose  »)(•)  to  maximize  the  expected  log-likelihood: 

<K0  =  max"1!  E(/fa(X),Y))]  (16) 

the  expectation  being  over  the  joint  distribution  of  X  and  Y .  Clearly,  this  is  the  direct 
generalization  of  mean  squared  error  minimization  in  the  Gaussian  case.  Mean-squared  error 
generalizes  to  (expected)  Kullback-Leibler  distance  in  non-Gaussian  models,  and  maximization 
of  the  expected  log-likelihood  is  equivalent  to  minimization  of  the  Kullback-Leibler  distance. 

The  use  of  expected  log  likelihood  has  also  been  suggested  by  Brillinger  (1979)  and  Owen 
(1983,  unpublished  manuscript).  Note  that  for  each  X  =  x,  this  theoretical  criterion  is  max¬ 
imized  by  the  true  value  r?(x);  this  is  a  key  fact  in  deriving  the  asymptotic  properties  of 
maximum  likelihood  estimates.  In  what  follows,  we  show  that  standard  maximum  likelihood 
estimation  for  generalized  linear  models,  local  scoring,  and  local  likelihood  estimation  can  all 
be  viewed  as  methods  for  empirically  maximizing  the  expected  log  likelihood. 

5.2.  Derivation  of  the  Estimation  Techniques  via  Expected  Log-Likelihood. 


One  way  to  use  (16)  for  estimation  of  r?(x)  would  be  to  assume  a  simple  form  for  rj{x),  like 
*?(*)  A>  +  The  expectation  in  (16)  could  then  be  replaced  by  its  sample  analogue, 

and  the  resultant  expression  maximized  over  0O  and  ft.  This  is  nothing  more  than  standard 
maximum  likelihood  estimation. 

Now  suppose  (as  is  the  point  of  this  paper)  that  we  don’t  want  to  assume  a  parametric 
form  for  rj{x).  Differentiating  (16)  with  respect  to  ij,  we  get 


dl 


E^dr)  0 


(17) 


Given  some  initial  estimate  r^^(x),  a  first  order  Taylor  series  expansion  gives  the  improved 
estimate 


v\x)  =  r?°(x)  - 


lx) 


(18) 
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or 


r?x(x)  =  E 


1°(*)  - 


dl 

drj° 


E  f-^4- 


(19) 


This  provides  a  recipe  for  estimating  r?(-)  in  practice.  Starting  with  some  initial  estimate 
f?  (x),  a  new  estimate  is  obtained  using  formula  (19),  replacing  the  conditional  expectations 
by  scatterplot  smooths.  The  data  algorithm  analogue  is  thus 


n\x)  = 


Smooth 


(20) 


Since  the  variance  of  each  of  the  terms  in  the  brackets  is  approximately  oc  E  ( j^4),  the  smooth 

should  use  weights  a  Smooth (0)~x  for  efficient  estimation.  The  data  algorithm  consists  of 
repeated  iterations  of  (20),  until  convergence. 

In  the  generalized  linear  model  case,  we  can  simplify  (19)  before  replacing  E(-  |x)  by 
Smooth.  We  compute  *  =  (y  —  $  =  (y  -  ^[V1^]  -  (g)V"1  ,  and 

E(sp‘  I1)  —  ~(^)2^_1-  Hence  the  update  simplifies  to 


v\x)  =  E 


,oW  +  (V-/x°)0 


(21) 


The  data  analogue  is 

*?1(x)  =  Smooth  [r7°(x)  -{-  (y  — 


(22) 


with  weights  {j^)2V  1.  This  is  exactly  a  smooth  of  the  adjusted  dependent  variable,  suggested 
on  intuitive  grounds  in  Section  4. 


Note  that  we  chose  the  form  (19)  instead  of  (18).  In  the  case  of  distributions,  they  are 
the  same  because  conditional  expectation  is  a  projection  operator.  Most  smooths  are  not 
projections  and  thus  the  two  forms  are  not  equivalent  in  the  data  case.  We  chose  (19)  because 
in  the  Gaussian  case  it  simplifies  to  ^(x)  =  Smooth  (y)  without  any  iterations,  whereas  (18) 
would  require  iterations  even  in  this  simple  case. 

The  local  likelihood  procedure  can  also  be  viewed  as  an  empirical  method  of  maximizing 
EI(»?P0,Y).  Instead  of  differentiating  this  expression  (as  above),  we  write  El(r)(X),Y)  = 
E(  E(l(r](X),Y)  \X  =  x)).  Hence  it  is  sufficient  to  maximize  E  (/(77(A),  Y)  \X  =  x)  for  each 
x.  The  corresponding  data  recipe  can  be  derived  as  follows.  Consider  estimating  t](x)  at  some 
point  x  =  x,\  An  estimate  of  E  (/(77(A),  Y)  \X  =  xt)  is 


*mX),Y)  \X  =  Xi)  =  (1/A)  £  l(rj(xj),yj)  (23) 

ieNi 
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where  k  =  #  of  data  points  in  Nt.  Assuming  q(x)  »  pK  +  fax  for  points  in  Nit  (23)  is  then 
maximized  over  fa  and  fa.  The  resulting  estimate,  q(Xi)  =  fa  +  faX{,  is  the  local  likelihood 
estimate  as  defined  in  Section  4. 

The  algorithms  described  here  can  be  used  in  any  likelihood-based  regression  model. 
As  a  technical  point,  note  that  in  the  exponential  family,  we  linked  the  additive  predictor 
*7  =  ]Ci  sj{Xj)  to  the  distribution  of  Y  via  q  =  g(n).  In  some  non-exponential  family  models, 
A*  is  a  complicated  function  of  the  model  parameters  or  may  not  exist  at  all.,  It  would  then 
be  desirable  to  link  q  to  some  other  parameter  of  the  distribution.  This  is  true  in  the  Cox 
model  (see  the  next  section).  In  any  case,  there  is  no  difficulty  —  however  q  is  linked  to 
the  distribution  of  Y ,  the  likelihood  is  some  function  of  q  and  its  derivatives  are  used  in  the 
updating  formula. 

To  summarize  so  far,  maximization  of  the  expected  log-likelihood  has  led  to  a  general 
technique  for  estimating  a  smooth  covariate  function:  the  local  scoring  procedure.  In  the  case 
of  the  exponential  family  likelihood  this  procedure  corresponds  to  smoothing  of  the  adjusted 
dependent  variable.  Standard  (linear)  maximum  likelihood  estimation  and  local  likelihood 
estimation  can  also  be  viewed  as  as  empirical  maximizers  of  expected  log-likelihood.  Equiva¬ 
lently  they  can  all  be  viewed  as  empirical  minimizers  of  the  expected  Kullback-Leibler  distance 
between  the  model  and  the  estimate. 

We  have  not  addressed  the  problem  of  multiple  covariates  —  this  will  be  done  in  Section 

7. 

5.3.  Span  selection. 

In  the  Gaussian  or  ordinary  additive  regression  models  we  use  the  CVSS  to  guide  us  in  selecting 
spans.  CVSS  is  approximately  unbiassed  for  the  Expected  Prediction  Squared  Error  ( PSE ), 
whereas  the  RSS  is  not  and  would  lead  us  to  pick  spans  of  0.  In  the  exponential  family,  the 
Deviance  is  the  analogue  of  RSS.  It  is  a  sample  estimate  of  the  expected  Kullback-Leibler 
distance  between  a  model  and  future  observations.  Just  like  the  RSS  it  will  be  biassed  for 
this  quantity.  For  span  selection,  one  can  think  of  cross-validating  the  deviance  in  order  to 
get  an  approximately  unbiassed  estimate  for  the  Kullback-Leibler  distance.  This,  however,  is 
very  expensive  due  to  the  non-linear  nature  of  the  estimation  procedures.  In  ordinary  additive 
regression,  simple  deletion  formulae  allow  one  to  compute  cross  validated  fits  in  linear  time. 
In  the  case  of  generalized  additive  models,  however,  the  entire  estimation  procedure  has  to  be 
repeated  n  times,  and  so  cross-validation  is  infeasible. 

Instead  we  use  cross-validation  to  select  the  span  each  time  we  compute  a  smooth,  as 
outlined  in  Section  4.2.  This  is  done  in  linear  time,  and  since  squared  error  is  a  first  order 
approximation  to  the  deviance,  it  can  be  thought  of  as  an  approximation  to  the  cross- validated 
deviance. 
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6.  Some  Examples. 

6.1.  The  Gaussian  Model. 

For  this  model,  rj  =  p,  so  (22)  simplifies  to  ^(x)  =  Smooth  [y],  and  the  local  scoring  algorithm 
reduces  to  a  running  lines  smooth  of  y  on  x. 

The  local  likelihood  procedure  also  gives  the  running  lines  smooth  of  y  on  x,  since  the 
local  m.l.e  is  r)(x,)  =  fio »  +  fin ,  fioi  and  fin  being  the  least  squares  estimates  for  the  points 
in  N{.  The  Gaussian  model  is  applied  to  a  large  meteorological  data  set  in  Section  11. 

6.2.  The  Logistic  Model. 

A  binomial  response  model  assumes  that  the  proportion  of  successes  Y  is  such  that  n(x)Y  \x  ~ 
Bi n(n(x),p(x)),  where  Bin(n(x),p(x))  refers  to  the  binomial  distribution  with  parameters 
n(x)  and  p(x).  Often  the  data  is  binary  in  which  case  n(x)  =  1.  The  binomial  distribution 
is  a  member  of  the  exponential  family  with  canonical  link  #(p(x))  =  log  =  ^(x).  In 

the  linear  logistic  model  we  assume  r?(x)  =  fi0  +  filX,  and  the  parameters  are  estimated  by 
maximum  likelihood  using  Fisher’s  scoring  or  equivalently  by  using  adjusted  dependent  variable 
regression.  The  smooth  extension  of  this  model  generalizes  the  link  relation  to  log  p^x)  = 

.  o  1—  p(z) 

r?(x).  The  local  scoring  step  is 

=  Smooth [rj°(i)  +  ~ (24) 

with  weights  n(x)p°(x)(l  —  p°(x)).  We  now  demonstrate  the  procedure  on  some  real  data. 

A  study  conducted  between  1958  and  1970  at  the  University  of  Chicago’s  Billings  Hospital 
concerned  the  survival  of  patients  who  had  undergone  surgery  for  breast  cancer  (Haberman, 
1976).  There  are  306  observations  on  four  variables. 

_  f  1  if  patient  i  survived  5  years  or  longer; 

1 0  otherwise. 

®*i  =  age  of  patient  i  at  time  of  operation 
xi'2  =  year  of  operation  t  (minus  1900) 

x,3  =  number  of  positive  axillary  nodes  detected  in  patient  i 

Figure  1  shows  the  response  variable  plotted  against  the  covariate  age.  The  solid  non-linear 
function  was  estimated  using  the  local  scoring  method.  For  a  single  covariate  one  could  simply 
average  the  0-1  response  directly,  with  iterative  weights  [p(xt)(l  -  p(*<))]-1.  This  estimates 
E(F  |x)  =  p(x),  and  is  the  dashed  curve  in  the  figure.  It  is  identical  to  the  function  found 
using  the  local  likelihood  method  fitting  local  constants  to  the  logits.  The  local  likelihood 
smooth  fitting  local  straight  lines  (the  more  usual  approach)  is  the  dotted  curve.  They  are  all 
very  similar,  with  the  main  differences  occuring  in  the  tails  where  bias  effects  play  a  role.  We 
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P  (survived) 


Figure  1.  Survival  of  patients  who  underwent  surgery  versus  age 
of  the  patient.  The  local  scoring  function  is  the  solid  curve,  the  local 
likelihood  function  is  dotted,  the  running  mean  of  the  y’s  is  dashed, 
and  the  linear  logistic  function  is  the  almost  straight  curve.  The  size 
of  the  points  reflects  the  number  of  observations. 

will  see  in  the  Section  7  that  in  fitting  multiple  covariate  models,  this  approach  of  smoothing 
the  response  variable  directly  breaks  down,  whereas  the  local  scoring  and  local  likelihood 
techniques  generalize  easily.  We  will  pursue  this  example  in  Section  7. 

6.3.  The  Cox  Model. 

The  proportional  hazards  model  of  Cox  (1972)  is  an  example  of  a  non-exponential  family 
regression  model.  This  model  is  used  to  relate  a  covariate  to  a  possibly  censored  survival 
time.  The  data  available  are  of  the  form  (yi,®i,£i,) . . .  (t /n,xn,6n),  the  survival  time  y,-  being 
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complete  if  Si  =  1  and  censored  if  S{  =  0.  We  assume  there  are  no  ties  in  the  survival  times. 
The  proportional  hazards  model  assumes  the  hazard  relation 


\x)  =  \0{tyx 


(25) 


The  parameter  @  can  be  estimated  without  specification  of  A0(f)  by  choosing  /?  to  maximize 
the  partial  likelihood 


«-n 

i€D 


(26) 


Figure  2.  The  Stanford  Heart  Transplant  data.  The  solid  curve 
is  the  local  scoring  function,  the  dashed  line  is  the  local  likelihood 
function,  and  the  dotted  line  is  the  parametric  model. 


In  the  above,  D  is  the  set  of  indices  of  the  failures  and  R<  is  the  risk  set  just  before  the 
failure  at  y,-. 

A  more  general  model  is 

X[t  |i)  =  Xo(t)en^  (27) 

where  f?(i)  is  a  smooth  function  of  x.  One  way  to  estimate  would  be  to  apply  the  local 
scoring  formula  (20).  Letting  l  equal  the  log-partial  likelihood  and  C,-  =  {k  :  i  e  Rk},  (the  risk 
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sets  containing  individual  »),  straighforward  calculations  yield 


dl 


dr)(Xi) 


t  =  *  -  E 


*6C,  ^j€Rk 


and 


d2l 


dr}{xi) 


,\2 


=  -M*<) 


E 


keCi  tec,- 

$x,  smooths  are  applied  to  these  quantities,  as  in  (20),  and  the 


_ +  e2r?(z*0  V  1 


(28) 


(29) 


Starting  with  say  f;(i) 
process  is  iterated. 


The  local  likelihood  technique  can  also  be  applied  to  the  Cox  model  —  this  is  described 
in  Tibshirani  (1984).  We  won’t  give  details  here.  Instead,  we’ll  illustrate  the  two  estimation 
techniques  with  a  real  data  example. 

Miller  and  Halpern  (1983)  provide  a  number  of  analyses  of  the  Stanford  heart  transplant 
data.  The  data  consist  of  time  to  failure  (months)  and  two  covariates,  age  (years)  and  T5 
mismatch  score.  Here  we  will  consider  only  the  age  variable. 

Figure  2  shows  the  smooth  obtained  by  local  scoring  (solid  line)  and  local  likelihood 
(broken  line).  Also  shown  is  the  fit  obtained  by  inserting  a  linear  and  quadratic  term  for  age 
(dotted  line).  The  smooths  are  very  similar  and  both  show  a  marked  non-linear  effect.  Table 
1  summarizes  the  results  of  the  smooth  procedures  as  well  as  a  standard  (linear)  Cox  model. 


Table  1.  Stanford  Heart  Transplant  Data 


Analysis  of  Age 


Model 

-2  Log  Likelihood 

dof 

Null 

902.40 

0 

Linear 

894.82 

1 

Linear  +  Quadratic 

886.24 

2 

Local  Likelihood  (span  .5) 

884.65 

2.95 

Local  Scoring  (span  .5) 

884.66 

2.95 

The  column  labelled  “dof”  means  degrees  of  freedom  —  this  is  explained  in  section  10. 
The  smooths  suggest  that  the  log  relative  risk  stays  about  constant  up  to  age  45,  then  rises 
sharply.  The  quadratic  model  forces  a  parametric  shape  on  the  function,  and  misleadingly 
suggests  that  the  relative  risk  drops  then  rises.  The  data  set  is  analysed  more  thoroughly  in 
Tibshirani  (1984). 
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7.  Multiple  Covariates. 


When  we  have  p  covariates,  represented  by  the  vector  X  =  (XUX2,. .  .Xp),  a  general  model 
specifies  E  (Y  |  X  =  x)  =  p  and  g(p)  =  r)(x)  where  r)  is  a  function  of  p  variables.  We  will  first 
discuss  the  Gaussian  case,  and  show  why  it  is  necessary  to  restrict  attention  to  an  additive 
model. 

We  assume 

Y  =  r)(X)  +  e  (30) 

where  r){x)  =  E  (F  |A  =  x),  Var(Y  \x)  =  a2,  and  the  errors  c  are  independent  of  X.  The 
goal  is  to  estimate  f?(x).  If  we  use  the  least  squares  criterion  E (Y  -  r?(X))2,  the  best  choice 
for  rj(x)  is  E(F  \X  =  *).  In  the  case  of  a  single  covariate,  we  estimated  E(F  |X  =  x)  by  a 
scatterplot  smooth,  which  in  its  crudest  form  is  the  average  of  those  y,-  in  the  sample  for  which 
Xi  is  close  to  x. 

We  could  think  of  doing  the  same  thing  for  multiple  covariates:  average  the  y,-  for  which 
Xi  is  close  to  x.  However,  it  is  well  known  that  smoothers  break  down  in  higher  dimensions 
(Friedman  and  Stuetzle,  1981);  the  curse  of  dimensionality  takes  its  toll.  The  variance  of 
an  estimate  depends  on  the  number  of  points  in  the  neighbourhood.  However,  you  have  to 
look  further  for  near  neighbours  in  high  dimensions,  and  consequently  the  estimate  is  no 
longer  local  and  can  be  severely  biassed.  This  is  the  chief  motivation  for  the  additive  model, 
rl(x)  —  s0  +  Ylj—i  sj{xj)-  Each  function  is  estimated  by  smoothing  on  a  single  co-ordinate;  we 
can  thus  include  sufficient  points  in  the  neighbourhoods  to  keep  the  variance  of  the  estimates 
down  and  yet  remain  local  in  each  co-ordinate.  Of  course,  the  additive  model  itself  may  be  a 
biassed  estimate  of  the  true  regression  surface,  but  hopefully  this  bias  is  much  lower  than  that 
produced  by  high  dimensional  smoothers.  The  additive  model  is  an  obvious  generalization 
of  the  standard  linear  model,  and  it  allows  easier  interpretations  of  the  contributions  of  each 
variable.  In  practice  a  mixture  of  the  two  will  often  be  used: 

i  p 

T]{x)  =  s0  +  J2  «;(*;)  +  J2  frxi-  (31) 

j=q+i 

In  later  sections  we  will  discuss  other  models  more  general  than  the  additive  model. 

7.1.  Estimation  —  The  Additive  regression  model. 

We  now  turn  to  the  estimation  of  s0,  si(-)> . . . ,  sp(-)  in  the  additive  regression  model 

p 

E(F  |x)  =  so  +  ^syfo),  (32) 

j'=i 


where  sq  is  a  constant  and  E  Sj(Xj)  =  0  V  j . 
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In  order  to  motivate  the  algorithm,  suppose  the  model  Y  =  s0  +  Y7j=\ »y(*y)  +  e  is  in 
fact  correct,  and  assume  we  know  «0,  si(0>  •  •  • »  «/-i(-),«y+i(-),  •  •  • »  sp(0-  If  we  define  the  partial 
residual: 

=  Y  ~  80  ~  y  ]  SfciXj, ;), 

then  E  (i?y  \  xy)  =  «y(xy)  and  minimizes  E(T  —  a0  —  ak{Xk))2.  Of  course  we  don’t  know 

the  s*(-)’s,  but  this  provides  a  way  for  estimating  each  «y(>)  given  estimates  !,  (•),  j  j .  The 
resulting  iterative  procedure  is  known  as  the  backfitting  algorithm  (Friedman  and  Stuetzle, 
1981): 

Backfitting  algorithm 

Initialization:  s0  =  E(y),  =  s§  =  . . .  =  8°  =  0,  m  =  0. 

Iterate:  m  <—  m  +  1 

for  j  =  1  to  p  do: 

Rj  =  r~. o  -  Ei;1!  >T(Xt)  -  EJ=)+1  »r 1  (*)• 

>?(*,)  =  E(K,|ly). 

Until:  RSS  =  E(V  —  so  —  X^y=i  Sy'C-Xy))2  fails  to  decrease. 

In  the  above  s'?1  denotes  the  estimate  of  s,  at  the  mth  iteration.  Notice  that  by  effectively 
centering  Y  at  the  start,  we  guarantee  that  Es™(X,)  =  0  at  every  stage.  It  is  clear  that 
RSS  does  not  increase  at  any  step  of  the  algorithm,  and  therefore  converges.  Breiman  and 
Friedman  (1982,  Theorem  5.19)  show  in  the  more  general  context  of  the  ACE  (Alternating 
Conditional  Expectation)  algorithm  that  the  solution  ]T)  Sy°(*y)  is  unique  and  is  therefore  the 
best  additive  approximation  to  E(y  \x).  This  does  not  mean  that  the  individual  functions 
are  unique,  since  dependence  amongst  the  covariates  can  lead  to  more  than  one  representation 
for  the  same  fitted  surface.  These  results  do  not  depend  on  the  validity  of  either  the  additive 
model  for  E  (Y  \  x)  or  the  additive  error  assumption  as  in  (30)  . 

If  we  return  to  the  world  of  finite  samples,  we  replace  the  conditional  expectations  in  the 
backfitting  algorithm  by  their  estimates,  the  scatterplot  smooths.  Breiman  and  Friedman  have 
proved: 

•  For  a  restrictive  (impractical)  class  of  smoothers,  the  algorithm  converges. 

•  For  a  less  restrictive  class,  the  procedure  is  mean  square  consistent  in  a  special  sense. 
Suppose  that  the  mth  iteration  estimate  of  sy  is  1™,  where  the  hat  implies  it  is  a  function 
of  the  sample  size  n.  Let  sj1  be  the  estimate  of  sy  at  the  mth  iteration  of  the  algorithm 
applied  to  the  distributions.  Then  E(IJl(A)  -  sf(X))2  ->0asn-*oo. 

A  special  case  arises  if  we  use  the  least  squares  regression  a  +  Sxy  to  estimate  E(-  |xy)  at 
every  stage  of  the  algorithm.  We  can  once  again  invoke  the  Breiman  and  Friedman  results  for 
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this  projection  operator,  which  show  that  the  algorithm  converges  to  the  usual  least  squares 
estimate  of  the  multiple  regression  of  Y  on  x.  This  is  true  for  both  the  usual  data  estimate  or 
the  estimate  in  distribution  space  as  in  Section  5.  We  give  an  elementary  proof  of  this  fact  in 
Appendix  A. 

Although  these  results  are  encouraging,  much  work  is  yet  to  be  done  to  investigate  the 
properties  of  additive  models.  In  multiple  regression  we  need  to  worry  about  collinearity  of 
covariates  when  interpreting  regression  coefficients;  perhaps  cocurvity  has  even  worse  impli¬ 
cations  when  trying  to  interpret  the  individual  functions  in  additive  models.  This  would  call 
for  non-parametric  analogues  of  linear  principal  components  analysis  —  a  standard  device  for 
determining  lower  dimensional  linear  manifolds  in  the  data.  Some  work  in  this  direction  has 
been  done  (Hastie,  1983b,  1984b,  Young  et  al.,  1978). 

If  the  purpose  of  our  analysis  is  prediction,  these  problems  are  less  important.  We  proceed 
in  an  exploratory  spirit,  and  hopefully  a  sound  bed  of  theory  will  develop  around  these  as  yet 
unanswered  questions. 

7.2.  Backfitting  in  the  Local  Scoring  Algorithm. 

For  multiple  covariates  the  Local  Scoring  update  (19)  is  given  by 

=  E 

and  in  exponential  family  case  (21)  is 

nH*)  =  E 

=  E  (Z  |  x) 

where  g{fz°)  -  r?°  and  Z  is  the  adjusted  dependent  variable.  For  the  reasons  described  in  the 
previous  section,  we  will  restrict  attention  to  an  additive  model: 

p 

T){x)  =  so  +  Y^Sj{*j) 

}=l 

We  see  that  (34)  is  of  the  same  form  as  equation  (32)  ,  with  Z  playing  the  role  of  Y.  Thus  to 
estimate  the  sy’s,  we  fit  an  additive  regression  model  to  Z,  treating  it  as  the  response  variable 
Y  in  (32)  .  The  sum  of  the  fitted  functions  is  t/®  of  the  next  iteration.  This  is  the  motivation 
for  the  generalized  backfitting  algorithm  which  we  give  for  the  exponential  family  case  as  in 
(34)  . 


■ 

dl 

- 

V°(x) - 

dip 

-  \x 

E 

Jfl  \x 
dip*  1*1 

(33) 


1°W  +  (5'-/)0l* 


(34) 
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Generalized  backfitting  algorithm 
Initialization:  s0  =  E  (Z),  s$  =  =  . . .  =  =  0,  m  =  0. 

Iterate:  m  <—  m  +  1 

1)  Form  the  adjusted  dependent  variable 


z  =  n™-1  +  (Y  -  u™-1)—!- 

where  f?”1-1  =  s0  +  £y=1  and  t?”1"1  = 

2)  Form  the  weights  W  =  )2Vr~1. 

3)  Fit  an  additive  model  to  Z  using  the  backfitting  algorithm  with  weights 
W ,  to  get  estimated  functions  s™  and  model  f?m. 

Until:  D  =  E  Dev(K,  /im)  fails  to  decrease. 


Step  3  of  the  algorithm  is  simply  the  additive  regression  backfitting  algorithm  with  weights. 
In  Appendix  B  we  show  why  weights  are  required  even  in  the  distribution  version  of  the 
algorithm.  To  incorporate  them,  the  data  is  first  transformed  using  the  weights,  and  the 
backfitting  algorithm  is  then  applied  to  the  transformed  data.  From  the  results  of  the  previous 
section,  we  see  that  the  inner  loop  converges.  In  particular,  if  each  smooth  was  replaced  by 
the  simple  regression  on  the  corresponding  covariate  (for  data  or  distributions),  the  backfitting 
algorithm  converges  to  the  usual  (weighted)  multiple  regression.  This  shows  that  in  this  case, 
the  algorithm  is  identical  to  the  usual  GLM  estimation  procedure  using  Fisher  scoring  as  in 
(9)  and  (10).  Once  again  the  data  analogue  of  the  algorithm  replaces  weighted  conditional 
expectations  by  weighted  smooths. 

The  backfitting  idea  is  also  used  in  the  local  likelihood  estimation  procedure  to  incor¬ 
porate  multiple  covariates.  To  estimate  a  new  sy,  or  adjust  sy  for  other  sk  in  the  model, 
Sj  is  re-estimated  holding  all  others  fixed.  The  algorithm  cycles  through  the  functions  until 
convergence.  The  details  can  be  found  in  Tibshirani  (1984). 

7.3.  The  breast  cancer  example  continued. 

We  continue  our  analysis  of  the  breast  cancer  data  using  all  3  covariates.  The  model  is  now 
l— p(i)  This  is  preferable  to  modelling  p(X)  by  an  additive  sum,  since 

we  would  have  to  check  that  the  estimated  probabilities  are  positive  and  add  to  1;  the  logit 
transform  allows  our  estimates  to  be  unrestricted.  There  are  other  reasons  for  using  the  logit 
transform;  on  the  logit  scale  prior  probabilities  appear  only  as  an  additive  constant  (McCullagh 
and  Nelder,  1983).  This  is  useful  in  biomedical  problems  where  there  is  often  some  established 

population  risk,  and  the  problem  is  to  see  what  factors  modify  this  risk  for  the  sample  under 
study. 
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Table  2  summarizes  the  various  models  fitted.  The  column  labelled  dof  refers  to  the 
approximate  degrees  of  freedom  or  number  of  parameters  of  the  model.  The  derivation  of  these 
quantities  is  outlined  in  Section  10.  .Auto  in  the  column  labelled  spans  indicates  that  each  time 
a  smooth  was  computed,  the  span  was  selected  by  cross-validation.  The  entry  D 2  refers  to  the 
percentage  of  deviance  explained,  and  is  in  direct  analogy  to  the  more  familiar  B?  in  regression. 
Figures  3,  4,  and  5  show  the  estimated  functions  for  our  model  with  deviance  308.22  and  dof 
=  8.8. 

Landwehr  et  al  (1984)  analysed  this  data  set  and  in  particular  considered  partial  residual 
plots  in  order  to  identify  the  fundamental  form  in  which  terms  should  appear.  Their  final 
model  was 


logit  p(x)  =  fo  +  Xlfix  +  +  x\fo  +  x2/34  +  Xlx2p5  +  (Iog(l  +  *3))/?6  (35) 

with  a  deviance  of  302.3  on  299  dof.  We  fit  this  model  using  GLIM,  and  then  using  the 
backfitting  procedure  with  linear  fits  for  the  transformed  variables.  As  expected,  the  results 
agreed  up  to  4  significant  figures,  which  provides  an  empirical  proof  of  the  result  proved  in 
Appendix  A.  This  model  is  labelled  parametric  in  the  table.  We  have  superimposed  their 
parametric  model  terms  in  the  figures,  and  note  that  the  functions  are  very  similar.  If  x'-b  is 
the  estimated  linear  model,  and  p\  the  corresponding  probability  estimate,  the  partial  residual 
for  variable  j  and  observation  i  is  defined  by 


r{xij)  =  biXij  + 


y»  -  p\ 

Pit1  -P|)' 


(36) 


Landwehr  et  al.  show  that  if  the  true  model  is  logit  p(x)  =  /?o  +  PkXk  +  sy(zy),  then 

E[r(^j)  \Xj  =  a]  «  Sj(a).  They  then  use  the  smooth  of  the  partial  residuals  to  suggest  the 
functional  form.  This  result  breaks  down  if  the  other  terms  are  not  linear  (Hastie,  1984a  and 
Gong,  1984).  One  can  see  from  the  previous  section  that  smoothing  the  partial  residual  corre¬ 
sponds  to  the  first  step  of  the  generalized  backfitting  procedure  in  the  local  scoring  algorithm, 
if  our  starting  guess  is  the  linear  model.  The  backfitting  procedure  continues,  however,  by 
simultaneously  estimating  and  adjusting  non-parametric  functions  for  all  the  covariates. 

8.  Comparison  of  Local  Scoring  to  Local  Likelihood  Estima¬ 
tion. 


In  a  number  of  examples  that  we  have  tried,  the  Local  Scoring  and  Local  Likelihood  procedures 
give  very  similar  results.  This  is  not  surprising  in  light  of  the  discussion  of  Section  5,  where 
both  techniques  are  viewed  as  empirical  estimates  of  E(logi).  The  difference  seems  to  be  in 
computational  speed:  local  scoring  is  0(n )  while  local  likelihood,  if  the  span  increases  like  nc, 
is  0(nc+1).  For  large  data  sets,  the  local  scoring  procedure  is  considerably  faster.  This  leads  us 
to  ask:  will  the  two  procedures  always  give  similar  estimates?  Artificially,  they  could  be  made 
very  different.  The  reason  for  this  is  as  follows.  For  a  single  covariate,  the  local  likelihood 
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Table  2.  The  Analysis  of  Deviance  (ANODEV)  table  for  the  breast 
cancer  data. 


Model 

Constant 
*1)  ?2  k  *3 

Xi,  X2  k  X3 
X±,  X2  k  x$ 
X2  k  i3 

X%  k  X3 

x\  St  X2 
Parametric 


Spans 

1 

all  linear 
all  .5 


Deviance 

353.67 
328.75 
307.89 
308.22 
317.66 

312.68 
346.71 
302.30 


OOil>'0boooOOo 


Figure  3,  The  circles  represent  i(age),  where  the  area  of  the  circles 
is  proportional  to  the  number  of  points.  The  dashed  term  is  the  cubic 
polynomial  term  in  (35) 


Figure  4.  The  circles  represent  $(  year  of  operation).  The  dashed 
term  is  the  linear  term  in  (35)  . 


Figure  5.  The  circles  represent  s(#  of  positive  axillary  nodes). 
The  dashed  term  is  the  log  term  in  (35)  . 


21 


procedure  is  completely  local;  that  is  if  xy  is  not  in  the  neighbourhood  for  estimating  r?(x,), 
then  (xj,y})  has  absolutely  no  effect  on  the  estimate  This  is  not  true  in  the  local 

scoring  procedure,  for  as  the  smooth  operation  is  iterated,  the  estimates  r)(xy)  enter  into  the 
computation  of  p(x,).  Thus  sending  yj  off  to  +oo  would  have  a  large  effect  on  the  estimate  of 
17 (x,)  in  the  smooth  updating  procedure,  but  no  effect  in  the  local  likelihood  procedure. 

Given  the  theoretical  basis  of  Section  5,  it  seems  eminently  reasonable  that  the  two 
procedures  be  asymptotically  equivalent.  We  sketch  a  proof  of  this  fact  in  Appendix  C. 

For  finite  samples  we  can  describe  operationally  the  difference  as  follows,  using  logistic 
regression  as  an  example.  Suppose  we  start  with  p°(xt)  =  y,  the  overall  proportion  of  l’s. 
Then  the  first  iteration  for  both  procedures  is  identical: 

•  Local  scoring  fits  the  weighted  least  squares  regression  of  zy  =  logit  p°  +  against 

Xj  for  j  €  to  obtain  the  estimate  17 1  (*,•);  this  is  the  local  linear  smoother  operation  in 
this  neighbourhood. 

•  Local  likelihood  does  exactly  the  same  operation  in  computing  the  MLE  in  the  neigh¬ 
bourhood,  since  this  is  the  first  step  in  the  adjusted  variable  regression  procedure  used 
to  compute  the  MLE. 

The  second  iterations  are  very  similar: 

•  Local  scoring  regresses  Zj  =  ^(xy)  +  against  xy  for  j  €  N{  to  obtain  the 

estimate  p2(x,). 

•  Local  likelihood,  however,  regresses  zy  =  17/ (xy)  +  against  xy,  where  rjj(xj) 

refers  to  the  extrapolated  value  of  rjl  at  xy  derived  from  the  linear  estimate  rj1(xj)  = 
Aw  +  Pi  ixj- 

If  the  function  is  fairly  linear  in  the  neighbourhood  then  these  two  steps  will  yield  similar 
estimates.  For  a  given  point  xy,  the  local  scoring  algorithm  uses  its  latest  estimate  of  p(xy) 
for  every  neighbourhood  in  which  xy  appears.  The  local  likelihood  procedure,  however,  uses  a 
linear  approximation  (on  the  r)  scale)  for  p(xy)  based  on  its  estimate  p(x,)  when  xy  is  in  Ay. 

9.  Asymptotic  Properties. 

Since  the  local  scoring  and  local  likelihood  procedures  use  local  maximum  likelihood  estimates, 
we  would  expect  them  to  have  reasonable  asymptotic  properties.  Tibshirani  (1984)  extends 
the  work  of  McCullagh  (1983)  to  establish  such  properties  for  local  likelihood  estimates  in 
the  exponential  family.  Consider  estimation  of  a  single  smooth  p(*)  at  a  fixed  point  xo-  Let 
kn  be  the  number  of  points  in  the  neighbourhood  Nfi  used  to  estimate  r?(x0).  Assume  that 
kn  ->  00,  but  the  neighbourhood  shrinks  in  such  a  way  that  max{l-yeArn}  \Xi  -  xy|  =  o(/£1/2). 
Then  under  smoothness  constraints  on  r?(-),  regularity  conditions  on  the  distribution  of  Y  and 
boundedness  on  the  covariate  values,  Tibshirani  shows  that  the  local  likelihood  estimate  rj(x0) 
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is  consistent  for  the  true  value  t]{x o),  and  has  the  efficiency  of  a  maximum  likelihood  estimator 
based  on  kn  (instead  of  n)  observations.  The  proofs  of  these  results  rest  on  the  fact  that  the 
local  likelihood  estimate  of  E(log  L  |a:o)  (see  Section  5)  is  consistent.  The  results  then  follow 
by  Taylor  series  expansions. 

Similar  results  should  be  obtainable  for  the  local  scoring  algorithm.  They  could  be  de¬ 
rived  from  the  asymptotic  equivalence  with  Local  Likelihood  estimation,  or  directly  as  follows. 
Consider  the  stable  point  of  the  update  step  (20).  Assuming  the  neighbourhoods  behave  as 
above,  Smooth  [p°(x)]  «  r)°(x)  for  large  n,  so  at  convergence  Smooth  [^]  «  0.  Under  condi¬ 
tions  such  that  Smooth  [•]  at  xq  is  consistent  for  E(-  |xo),  the  asymptotic  properties  of  f){x o) 
will  follow  from  standard  Taylor  series  arguments. 

10.  Deviance  and  Degrees  of  Freedom. 

In  generalized  linear  models,  the  goodness  of  fit  of  an  estimate  ft  is  measured  by  the  deviance. 
Wald  s  theorem  tell  us  that,  given  two  nested  linear  models  and  the  hypothesis  that  the  smaller 
model  is  correct,  the  deviance  decrease  in  fitting  the  larger  model  is  asymptotically  c2\p  - p  , 
where  px  and  p2  are  the  ranks  of  the  two  linear  spaces.  That  is,  the  number  of  parameters  fit 
give  the  number  of  degrees  of  freedom  of  the  corresponding  deviance  decrease. 

This  leads  us  to  ask  similar  questions  for  the  smooth  estimates  described  in  this  paper. 
We  will  restrict  our  discussion  to  the  exponential  family  case,  and  also  to  the  case  of  known 
variance  a2  =  1.  The  question  of  interest  is:  how  many  “parameters”  does  a  smooth  employ? 
This  will  depend  on  the  span.  With  a  span  of  2.0  (i.e.  every  neighbourhood  contains  all  the 
data  points),  2  parameters  are  used.  With  a  span  of  0  (i.e.  1  point  per  neighbourhood),  n 
independent  parameters  are  used.  Thus  for  the  usual  spans  (.3  to  .7),  the  number  of  parameters 
should  be  somewhere  between  2  and  n. 

For  the  local  likelihood  procedure,  Tibshirani(l984)  provides  a  definition  of  “number  of 
parameters”  or  “degrees  of  freedom”  and  an  approximate  method  for  determining  it.  This 
follows  work  by  Cleveland  (1979)  on  degrees  of  freedom  for  scatterplot  smoothers. 

Consider  first  a  multiple  linear  regression  model  with  variance  equal  to  1.  Let  yx  and  y2 
be  the  fitted  values  for  two  nested  models,  and  assume  that  the  sub- model,  say  model  1,  is 
correct.  Then  the  decrease  in  residual  sum  of  squares  has  expected  value  p2  -  Pi-  One  way 
to  derive  this  is  to  write  yx  =  Hxy  and  y2  =  H2y,  where  Hx  and  H2  are  the  corresponding 
hat  matrices.  Then  a  simple  calculation  shows  that  the  expected  decrease  in  residual  sum  of 
squares  is  just  trace{H2)  -  trace{Hx )  =  p2  -  px. 

For  a  running  lines  smoother,  an  analogous  result  can  be  derived.  Consider  a  single 
covariate  x.  First,  we  note  that  the  output  of  the  smoother  can  be  written  as  y  =  Sy  where 
5  is  a  “smoother  matrix”.  Now  consider  two  fit  vectors  yx  and  y2  obtained  by  smoothing  y 
on  x  with  different  spans.  By  analogy  to  the  multiple  linear  regression  case,  we  can  define  the 
difference  in  degrees  of  freedom  of  the  fits  by  the  expected  value  of  RSS(y,y2 )  -  RSS( y,yx). 
In  the  regression  set-up,  this  expected  value  was  computed  under  the  assumption  that  the 
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smaller  model  was  correct.  Here  we  make  the  analogous  assumption  that  Eyi  «  Ey2,  i.e. 
that  the  two  fits  are  about  the  same  on  the  average.  Then  under  this  assumption,  it  is  easy 
to  show  that  E(RSS(y,y2)  —  RSS(y,y i))  =  trace(S2 )  —  trace(Si).  Hence  we  can  think  of 
trace(S)  as  the  number  of  degrees  of  freedom  of  the  smoother  based  on  S. 

Further,  the  same  result  is  approximately  true  for  any  local  likelihood  fit  in  the  exponential 
family.  Consider-two  fits  fii  and  £2  based  on  a  covariate  vector  *  and  spans  /1  and  /2.  Then 
the  difference  in  degrees  of  freedom  of  the  two  smooths  is  defined  to  be  E(  Dev(/i1,y)  - 
Dev(/i2,  y))  under  the  assumption  that  E /i±  fa  E/ia.  Let  Si  and  S2  be  the  smoother  matrices, 
based  on  *  that  produce  running  lines  smooths  of  spans  li  and  /2  respectively.  Then  it  can  be 
shown  that  E(  Dev(/ii,y)  —  Dev(/i2, y))  «  trace(S2)  -  trace{Si). 

Given  a  local  likelihood  fit,  we  easily  work  out  trace(S)  and  use  it  to  determine  the 
significance  of  the  smooth.  As  a  example,  for  200  equally  spaced  X  values  and  a  span  of 
.5,  trace(S)  is  about  3.6.  Hence  the  smooth  uses  roughly  3.6  parameters.  We  say  “roughly” 
because  the  distribution  of  this  decrease  is  not  x2,  however,  but  is  more  spread  out.  The 
trace(S)  formula  should  only  be  used  as  a  rule  of  thumb. 

Since  the  local  scoring  procedure  produces  estimates  similar  to  local  likelihood  estimates, 
the  same  result  should  approximately  be  true  for  it  as  well.  Due  to  the  complex  nature  of  the 
estimation  procedure,  however,  we  have  been  unable  to  verify  this  analytically.  Instead,  we 
describe  a  small  simulation  study  designed  to  check  the  result  numerically. 

10.1.  Degrees  of  Freedom  Simulations. 


Table  3.  Results  of  Degrees  of  Freedom  Simulation.  Entries  in 
Lines  (2) — (7)  are  mean  (variance)  of  deviance  decrease.  LS  and  LL 
mean  local  scoring  and  local  likelihood  respectively. 


Span 

Source  _ _ _ .3  .4  .5  .6 

(1)  Trace(S)-l  4.09 


(2)  Scatterplot  Smooth(y  normal)  4.14(10.00) 

(3)  Scatterplot  Smooth(y  uniform)  4.19(10.06) 

(4)  Logistic  Model  (LS)  (constant  vs  smooth)  4.35(13.86) 

(5)  Logistic  Model  (LS)  (linear  vs  smooth)  3.31(9.54) 

(6)  Logistic  Model  (LL)  (constant  vs  smooth)  4.34(13.47) 

(7)  Logistic  Model  (LL)  (linear  vs  smooth)  3.29(11.71) 


3.32  2.65  2.34  2.16 

3.39(7.75)  2.61(6.03)  2.31(5.08)  2.09(4.32) 

3.46(8.50)  2.77(6.52)  2.41(5.79)  2.21(4.99) 

3.39(11.78)  2.67(9.16)  2.35(8.02)  2.25(6.50) 

2.36(8.27)  1.61(5.12)  1.33(4.12)  1.21(3.66) 

3.40(11.62)  2.72(9.12)  2.28(7.51)  2.17(6.28) 

2.25(8.25)  1.63(6.21)  1.29(4.58)  1.12(2.89) 


Table  3  shows  the  results  of  a  modest  simulation  study  designed  to  check  the  accuracy  of 
the  formula  E(D(y,yi)  —  D(y,  y2))  =  trace^)  -  trace(Si).  The  numbers  in  the  table  were 
obtained  as  follows.  100  x  values  were  generated  from  .V(0, 1)  and  fixed  for  the  entire  table. 
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Given  these  x  values,  we  constructed  the  running  lines  smoother  matrices  for  the  indicated 
spans,  and  the  trace  of  each  matrix  (minus  1)  is  shown  in  line  (1). 

Consider  for  example  the  entry  4.09  in  the  top  left  hand  corner.  According  to  the  discus¬ 
sion  of  the  preceding  section,  this  should  be  the  expected  decrease  in  deviance  due  to  fitting  a 
local  likelihood  or  scoring  model  with  that  span  .3  versus  a  model  with  only  a  constant. 

To  obtain  line  (2),  we  generated  100  y,’s  from  a  A/ (0, 1)  distribution  and  computed 
RSS(y,y  1)  —  RSS(y,y),  y  being  the  fit  from  a  scatterplot  smoother  (y  =  Py)  with  span 
as  shown.  Line(2)  shows  the  mean  and  variance  from  500  such  repetitions  of  this  process. 

Line  (3)  was  obtained  in  same  way  as  line  (2),  except  that  the  y.’s  were  generated  from 
uniform  (— \/3,  \/3),  the  range  chosen  so  that  Var(yt)  =  1. 

To  obtain  line(4),  we  generated  100  y,’s  from  binomial (1, 1/2)  and  fit  a  local  scoring 
logistic  model  with  spans  of  .3  to  .7.  The  numbers  show  the  mean  and  variance  of  D(y,y  1)  — 
D{y,y)  over  500  repetitions. 

Line  (5)  was  generated  in  a  similar  fashion  as  line  (4),  showing  instead  the  mean  and 
variance  of  D(y,yt)  -  D(y,y),  yt  being  the  linear  logistic  fit,  with  y,  generated  from  a  linear 
logistic  model,  P(y,-  =  1  |a:)  =  e2x/(l  +  e2x). 

Lines  (6)  and  (7)  are  the  same  as  (4)  and  (5)  except  that  the  smooths  were  obtained  by 
local  likelihood  estimation. 

Note  that  for  all  the  models,  a  span  of  2.0  gives  either  exactly  or  asymptotically  a  mean 
value  of  1  (by  Wald’s  theorem),  and  trace(S)  -  1  is  also  equal  to  1. 

The  results  give  fairly  strong  support  to  the  approximation  E{D(y,jji)  -  D(y,y2))  = 
trace(Si)  —  trace(Si).  Lines  (2)  and  (3)  agree  well  with  (1),  not  surprising  since  the  approx¬ 
imation  is  exact  for  scatterplot  smoothers.  Line  (4)  also  is  in  good  agreement,  with  a  small 
upward  bias  for  smaller  spans.  Line  (5)  should  be  1  less  than  line  (1),  (since  the  global  linear 
fit  uses  2  degrees  of  freedom)  and  the  results  indicate  that.  As  we  expected,  the  local  likelihood 
results  are  very  similar  to  the  local  scoring  numbers. 

The  variance  results  are  a  little  unsettling.  The  variance  to  mean  ratio  is  often  greater 
than  2  (the  ratio  for  a  chi-square  variate),  especially  for  the  non-Gaussian  models. 

We  conclude  from  these  simulations  that  the  approximation  E(Z)(y,y1)  -  D(y,y2))  = 
trace(S2)  -  trace(S i)  is  satisfactory  as  a  rough  rule  of  thumb,  for  the  Gaussian  and  logistic 
models.  We  do  note,  however,  that  the  distribution  of  this  decrease  is  more  spread  out  than 
a  chi-square  variate  with  the  corresponding  degrees  of  freedom,  so  that  tests  based  on  the 
percentile  points  will  be  too  liberal. 

The  numbers  reported  here  for  local  likelihood  estimation  can  also  be  found  in  Tibshi- 
rani(1984).  In  that  study,  the  Cox  model  was  also  included  in  the  simulation,  and  the  trace 
formula  was  found  to  be  biassed  downward.  Thus  to  accurately  assess  the  significance  of  a 
Cox  smooth  in  real  examples,  the  mean  decrease  must  be  found  by  simulation.  This  was  done 
for  the  example  of  Section  6. 
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11.  Example:  Ozone  Concentration  data. 

In  this  section,  we  apply  the  local  scoring  procedure  to  some  data  on  atmospheric  ozone 
concentration,  given  in  Breiman  and  Friedman  (1982).  The  data  consist  of  330  observations 
on  10  variables: 

Response: 

UP03:  Upland  Ozone  Concentration  (ppm) 

Predictors: 

VDHT:  Vandenburg  500  millibar  height  (m) 

HMDT:  Humidity  (percent) 

IBTP:  inversion  Base  Temperature  (F°) 

SBTP:  Sandburg  Air  Force  Base  Temperature  ( C° ) 

IBHT  Inversion  Base  Height  (feet) 

WDSP:  Wind  Speed  (mph) 

DGPG:  Daggot  Pressure  Gradient  (mmhg) 

VSTY:  Visibility  (miles) 

DOYR:  Day  of  Year 


The  objective  is  to  study  the  effect  of  various  meterological  variables  on  atmospheric 
ozone  concentration.  Following  Breiman  and  Friedman,  we  considered  all  the  covariates  except 
DOYR  initially,  then  examined  the  effect  of  adding  DOYR  to  our  model. 

We  used  the  Gaussian  additive  model  to  explore  these  data.  A  summary  of  the  models  is 
given  in  Table  4. 


Table  4.  The  ANOVA  table  for  the  Ozone  Concentration  Data. 


Model  .  Spans  dof  Deviance  (RSS)  R2 


Constant 

1 

21115.41 

First  8  predictors 

all  linear 

9 

6539.00 

.69 

First  8  predictors 

all  .5 

22  .5 

5176.56 

.75 

All  9  predictors 

auto 

21  .8 

4292.28 

.80 

SBTP,  IBTH,  DGPG,  VSTY 

auto 

11  .0 

5431.93 

.74 

SBTP,  IBTH,  DGPG,  VSTY, 
DOYR 

auto 

12  .4 

4736.60 

.78 

Semi-parametric 

11  .2 

4848.99 

.77 

26 


SBTP 


Figure  6.  The  estimated  function  for  Sandburg  Air  Force  Base 
Temperature.  The  area  of  the  circles  is  proportional  to  the  number  of 
data  points  occurring  at  that  value  of  SBTP. 
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Figure  8.  The  estimated  function  for  Daggot  Pressure  Gradient. 


0  100  200  300  400 

VSTY 


Figure  9.  The  estimated  function  for  Visibility. 
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s(DOYR) 


DOYR 

Figure  10.  The  estimated  function  for  Day  of  the  Year. 

As  a  pre-screening  step,  we  examined  the  effect  of  dropping  each  of  the  8  covariates  from 
the  full  model.  For  this  phase,  a  fixed  span  of  .5  was  used.  The  full  model  had  a  residual  sum 
of  squares  of  5176.56  on  22.5  degrees  of  freedom.  Based  on  this,  the  estimate  of  residual  error 
is  16.85.  This  compares  to  a  standard  multiple  linear  regression  fit  having  a  residual  sum  of 
squares  of  6539.00  on  9  degrees  of  freedom.  Using  F-values  as  a  rough  guide,  the  variables 
SBTP,  IBHT,  DGPG  and  VSTY  were  seen  to  cause  a  significant  increase  in  the  residual  sum 
of  squares,  and  the  remaining  predictors  were  dropped. 

Using  these  selected  predictors,  an  additive  model  was  fit,  this  time  allowing  the  procedure 
to  choose  the  optimal  spans  by  cross-validation.  The  resultant  model  had  a  residual  sum  of 
squares  of  5481.93  on  8.8  degrees  of  freedom,  and  is  not  significantly  worse  than  the  full  model. 
We  then  added  the  variable  DOYR  to  the  model,  and  it  was  highly  significant,  decreasing  the 
residual  sum  of  squares  by  over  700,  with  12.4-11.0=1.4  additional  degrees  of  freedom.  Note 
that  instead  of  using  DOYR’s  degrees  of  freedom  (which  was  1.9),  we  use  the  difference  in 
degrees  of  freedom  in  the  two  models.  The  reason  is  that  when  a  variable  is  added  to  a  model, 
the  spans  chosen  for  other  variables  can  change  and  hence  their  degrees  of  freedom  change 
also. 

Dropping  any  of  these  variables  caused  a  large  increase  in  the  residual  sum  of  squares. 
The  fit  of  the  full  model  (all  9  predictors)  is  also  shown  in  Table  4. 

At  this  point  we  mention  a  complexity  caused  by  the  varying  spans.  Although  it  seems 
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that  the  4  variable  model  is  ‘nested”  within  the  8  variable  model,  there  is  no  guarantee  that 
its  residual  sum  of  squares  will  be  higher.  This  is  because  the  spans  aren’t  chosen  to  minimize 
the  overall  residual  sum  of  squares.  It  is  quite  possible  that  in  dropping  a  useless  predictor, 
the  residual  sum  of  squares  will  decrease!  This  is  not  a  practical  problem,  but  it  would  create 
some  unusual  twists  in  a  testing  theory  for  additive  models. 

Figures  6  —  10  show  the  estimated  smooths  for  the  5  predictors.  All  the  variables  seem 
to  display  non-linear  effects.  To  simplify  the  model,  we  forced  in  a  parametric  form  for  each 
of  the  variables,  one  at  a  time.  A  linear  term  was  tried  for  SBTP  and  VSTY,  and  quadratic 
terms  for  IBHT,  DGPG  and  DOYR.  Both  linear  terms  proved  to  be  inadequate,  especially  for 
SBTP.  Its  linear  term  caused  the  residual  sum  of  squares  to  increase  to  5214.03,  with  about 
1  less  degree  of  freedom.  The  quadratic  terms  were  all  satisfactory.  Thus  we  have  our  final 
model: 

UPOZ  =  constant  +  sx{S  BTP)+axIBHT  +  a2IBHT 2  +  byDGPG  +  b2{DGPG )2 

+sz(vsty)  +  Cidoyr  +  C2{doyr )2  (37) 

with  a  residual  sum  of  squares  of  4858.99  on  11.2  degrees  of  freedom.  The  RSS  of  this  model 
is  112  higher  than  the  RSS  for  the  nonparametric  model,  but  for  descriptive  purposes  it  is 
adequate. 

Breiman  and  Friedman  fit  an  ACE  (Alternating  Conditional  Expectation)  model  to  these 
data.  The  ACE  model  is  the  same  as  the  additive  Gaussian  model,  except  that  they  also 
estimate  a  tranformation  for  the  response.  The  final  model  obtained  by  Breiman  and  Friedman, 
using  a  forward  stepwise  ACE  model  contains  the  same  predictors  as  the  above  model.  In 
addition,  the  estimated  transformation  obtained  from  ACE  was  only  slightly  non-linear,  and 
hence  the  estimated  smooths  from  ACE  are  very  similar  to  those  in  Figures  6  —  10. 

We  discuss  an  alternative  to  response  variable  transformation  and  apply  to  these  data  in 
Section  12. 

12.  Transformations  of  the  Additive  Model. 

For  exponential  families,  the  model  we  have  discussed  up  to  now  has  the  form  g{]i)  =  rj  — 
XTi  s;(^;')>  where  g(-)  is  the  (known)  link  function.  A  more  general  model  is 
9  if*)  =  fin)  -  fiiyiSj(Xj)),  where  /(•)  is  a  unspecified  smooth  function.  As  we  did  for  the 
sji‘)  we  will  show  how  to  estimate  /(■)  non-par ametrically.  For  ease  of  interpretation,  we 
will  restrict  /(•)  to  be  monotone.  Note  that  since  /(•)  is  arbitrary,  so  is  /-1(ff(-)),  and  we 
could  write  this  as  g  it]).  Hence  non-parametric  estimation  of  /(•)  provides  non-parametric 
estimation  of  the  link  function.  In  some  applications,  we  could  set  «?(•)  equal  to  the  identity 
function;  in  others,  we  might  want  to  start  with  g(-)  equal  to  the  natural  link  for  the  problem, 
and  see  if  the  estimated  /(•)  is  close  to  the  identity  function. 

Estimation  of  /(•)  can  be  achieved  through  a  modification  of  the  local  scoring  algorithm. 
The  new  procedure  consists  of  two  alternating  loops,  one  each  for  the  estimation  of  the  Sj(-)’s 
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and  /(•).  To  estimate  «(•)  with  /(•)  fixed,  we  use  formula  (20)  noting  that  the  derivatives  now 
involve  the  derivatives  of  /(•).  In  the  generalized  linear  model  case  this  reduces  to 


r,1(x)=  Smooth  [r?°(x)  +  (y  - 

with  weights  W~}  =  [(^)2//V)2F°-  To  estimate  /(•)  with  J(-)  fixed,  let  b  = 
the  update  is 


/1(6)  =  Smooth  f°(b)  — 
In  the  generalized  linear  models  case,  this  is 


dl 

drj° 


Smooth 


(38) 
s(z),  and 

(39) 


f\b)  =  Smooth  [/°(6)  +  (y  -  M °)~ jp]  (40) 

with  weights  W  1  =  (^r)2U°.  The  two  loops,  estimating  the  Sj(-)’s  and  estimating  /(•),  are 
alternated  until  convergence. 

For  non-exponential  family  models,  the  link  relation  would  be  of  the  form  g(9)  =  f(v), 

where  9  is  some  parameter  ofthe  likelihood.  For  example,  one  generalization  of  the  Cox  model 
would  be  A (t  \x )  =  Ao(t)e^^-'i 

The  algorithm  requires  two  special  subroutines.  First,  the  derivatives  of  /(•)  are  needed 
for  the  first  step—  Jerome  Friedman  kindly  supplied  us  with  his  procedure  for  estimating  the 
derivatives  of  a  smooth.  Secondly,  the  estimate  of  /(■)  must  be  monotone.  The  monotone 
smoothing  technique  described  in  Friedman  and  Tibshirani  (1984)  was  used  for  this  purpose. 

We  tried  this  procedure  on  two  examples.  The  first  data  set  concerns  the  strength  of 
yarns  and  is  taken  from  Box  and  Cox  (1964).  It  consists  of  a  3  x  3  x  3  experiment,  the  response 
being  number  of  cycles  to  failure  and  three  covariates:  length  of  test  specimen,  amplitude  of 
loading  cycle  and  load.  Box  and  Cox  fit  linear  terms  to  the  covariates  and  found  that  the  log 
transformation  was  ideal  for  the  response.  Note  that  if  Y  has  mean  n  and  variance  U(p),  then 
logy  has  mean  about  log/i  and  variance  about  U(m)//A  Hence  if  the  log  transformation  is 
appropriate  (both  for  additivity  of  effects  and  variance  stabilization)  then  a  generalized  linear 
model  with  T]  =  log (/i)  and  V(fji)  —  p2  should  also  be  appropriate. 

To  test  our  procedure,  then,  we  set  y(p)  =  p  and  V(n)  =  and  a  linear  term  was  fit  for 
each  covariate.  The  estimated  /(•)  is  shown  in  figure  11,  along  with  the  exponential  function. 
The  agreement  is  very  good.  Furthermore,  the  residual  sum  of  squares  of  the  final  model  was 
very  close  to  that  obtained  by  Box  and  Cox  for  the  log  transformation. 

As  a  second  example,  the  link  estimation  procedure  was  applied  to  the  ozone  concentration 
data.  Using  the  covariates  of  Table  4  and  the  spans  chosen  there,  the  estimated  /(•)  is  shown 
in  figure  12.  The  transformation  has  positive  curvature,  indicating  that  mild  transformation 
of  the  additive  predictor  may  improve  the  fit.  This  is  in  qualitative  agreement  with  Breiman 
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V 

Figure  11.  The  solid  function  is  the  estimated  transformation  /(•) 
found  for  the  Box-Cox  data.  The  dashed  function  is  the  exponential 
suggested  by  the  log-transform  of  the  response  variable  y. 

and  Friedman’s  analysis:  using  a  response  variable  transformation  method  (see  ACE,  next 
section),  they  find  that  a  transformation  with  slightly  negative  curvature  is  appropriate. 

Despite  these  encouraging  results,  the  link  estimation  procedure  is  still  experimental.  The 
reason  is  that  the  iterations  can  be  unstable  if  the  derivatives  of  /(•)  get  too  close  to  zero.  We 
are  presently  studying  this  problem. 

13.  Relationship  to  Other  Methods. 

As  we  have  seen,  the  Local  Scoring  technique  generalizes  standard  (linear)  likelihood  methods. 
When  each  neighbourhood  contains  all  the  data  points,  Local  Scoring  corresponds  exactly 
to  standard  (linear)  maximum  likelihood  estimation.  For  smaller  spans,  the  Local  Scoring 
procedure  can  detect  curvature  in  the  covariate  functions.  In  the  class  of  GLM’s,  the  linear 
predictor  is  generalized  to  a  sum  of  smooth  functions.  Note  also  that  this  technique  provides 
non-parametric  “quasi-likelihood”  estimation.  McCullagh  (1983)  notes  that  in  the  exponential 
family,  the  score  equation  involves  only  the  mean  and  variance  of  Y.  This  if  one  is  only  willing 
to  assume  a  mean-variance  relationship  for  Y,  it  might  be  reasonable  to  base  the  estimation 
procedure  on  the  corresponding  exponential  family  score  equation.  McCullagh  calls  this  “quasi- 
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Figure  12.  The  estimated  transformation  for  the  Ozone  Concen¬ 
tration  Data  of  Section  11. 

likelihood  estimation.  Since  local  scoring  (and  local  likelihood)  depend  only  on  the  mean  and 
variance  as  well,  they  can  be  thought  of  as  extensions  of  quasi-likelihood  modelling. 

The  models  described  in  this  paper  are  also  related  (when  the  error  distribution  is  Gaus¬ 
sian)  to  a  number  of  non-parametric  regression  models  that  have  been  recently  suggested. 
Friedman  and  Stuetzle  (1981)  introduced  the  projection  pursuit  regression  model: 

p 

Y  =  sj  (a}Xj  )  +  error  (41) 

l 

The  directions  a;  are  found  by  a  numerical  search,  while  the  «,•(•)'«  are  estimated  by  smoothers. 
Friedman  and  Stuetzle  call  the  special  case  of  co-ordinate  directions,  i.e.  the  model 

p 

Y  =  ^2  sj  ( Xj )  +  error  (42) 

l 

the  “projection  selection”  model.  This  corresponds  to  the  additive  Gaussian  model  described 
here,  and  the  algorithm  for  estimating  the  smooths  is  identical. 

The  ACE  (Alternating  Conditional  Expectation)  model  generalizes  the  additive  Gaussian 
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model  by  estimating  a  transformation  of  the  response: 

p 

8{Y)  =  +  error  (43) 

1 

Related  to  this  is  the  PACE  (Predictive  ACE)  model  of  Friedman  and  Owen  (1984) 

p 

Y  =  /(^a(X,))  +  error  (44) 

l 

This  model  transforms  the  mean  of  Y  instead  of  Y.  The  PACE  model  is  a  special  case  of  the 
the  additive  model  with  link  estimation  desribed  in  the  previous  section:  it  corresponds  to  the 
identity  link  (/i.  =  rj)  and  Gaussian  likelihood. 

Finally,  we  note  that  the  link  estimation  procedure  of  Section  12  can  be  used  to  further 
generalize  the  model  to  allow  covariate  projections  (as  in  PPR).  The  model  would  be 

p 

$(/*)  =  X^J'K*/)  (45) 

l 

A  single  term  <f>{a'z)  could  be  estimated  by  setting  /(•)  =  and  forcing  «, •(*,•)  to  be  linear 
(=  a,-z,').  Additional  terms  could  be  added  in  a  forward  stepwise  manner. 


Software 

A  fortran  program  that  computes  local  scoring  estimates  for  exponential  family  models  is 
available  from  either  author. 
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Appendix  A.  The  backfitting  algorithm  using  linear  fits. 

In  this  appendix  we  prove  that  the  fit  vector  from  the  backfitting  algorithm  converges  to  the 
least  squares  answer  if  global  linear  fits  are  used  to  estimate  E(-  |x).  Let  V  be  the  subspace 
spanned  by  z i,  *2>  •  •  • ,  with  orthogonal  complement  V-1-;  here  Xj  is  a  n  vector  of  observations 
on  variable  j.  Let  Pj  denote  linear  projections  onto  Xj.  For  any  vector  a  let  a  be  its  projection 
onto  V  and  ax  =  a  —  a.  The  residual  after  m  cycles  of  the  backfitting  algorithm  through  the 
p  predictors  is  given  by 

rm  =  Cmy 

where  (46) 

C  =  (I-Pp)(I-Pp_1...(I-P1)y. 

It  is  immediate  that  rm  =  Cmy  +  y1. 

Theorem  1 

||CmyJj  — *•  0,and  thus  rm  — >  y-1. 

Proof.  (Stuetzle,  1983) 

For  any  vector  a  we  use  the  natural  norm  for  matrices  to  get 

IlCaH  <  ||I  -  Pp||  ||(7  -  Pp_i) Pl)a|| 

<  ||(/  —  P p-i) ...  (7  —  Pi)o|| 

:  (47) 

<  li«ll 

since  ||7  —  Pj\\  =  1  V j.  Similarly 

\\Ca\\  =  ||o|| 

=>||(7-P1)a||  =  ||a|| 

=>  (7  —  Pi) a  =  a 
=>  a  G  xj; 

But  then  a  €  x\  ....  and  finally  a  G  xj.  Thus  |[Ca||  =  ||o||  =>•  a  eV1.  So  if  a  G  V  then 

||Ca||  <  ||a|| 

<(l-e)||a||. 
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where  0  <  e  <  1.  Also  Ca  G  V  for  a  E  V.  Hence 


||C'ma||  <  (1  —  e)  ||Cm-1a|j 
<  (1  —  e)m  ||a|| 

This  is  true  for  any  a  €  V .  Since  y  G  V,  the  theorem  is  proved. 


Appendix  B.  The  use  of  weights  in  the  backfitting  procedure. 


The  function  f{x)  that  minimizes  the  weighted  least  squares  problem  Eiu(A)(y  -  /(A))2  is 
simply  E(F  \x),  and  the  weights  play  no  role.  When  we  estimate  this  quantity  using  global 
or  local  straight  line  fits,  however,  the  Gauss-Markov  theorem  tells  us  that  we  can  estimate 
the  parameters  more  efficiently  by  using  weights. 

If  we  wish  to  minimize 


Etn(X)(F -£«,•(*, ))2  (48) 

and,  as  in  the  backfitting  procedure,  «i,...,«p_i  are  known,  the  situation  is  different.  We 
write  (48)  as 

E*p  E*! ,...JCP,Y  \x,  [(^»0'r»*)  ~  sp{Xp))2  |  ip]  (49) 

where  Rp(Y,X)  =  Y  -  Yjj?p  sj{xi)-  Minimizing  this  function  w.r.t.  sp  yields 


_  E[tn(X)Ep(y,X)]xp] 
P  E[tn(A)|xp] 


(50) 


Thus  even  in  the  distribution  case,  we  need  to  use  weights.  It  is  clear  that  the  weights  play  no 
role  in  the  distribution  case  if  they  depend  only  on  the  variable  on  which  we  are  conditioning. 

In  the  generalized  additive  model  situation,  each  iteration  of  the  scoring  procedure  cor¬ 
responds  to  a  weighted  least  squares  problem,  with  weights  as  specified  in  the  algorithm.  The 
weights,  of  course,  usually  depend  on  the  unknown  model  and  so  the  latest  estimates  are 
substituted. 


Appendix  C.  Asymptotic  equivalence  of  Local  Scoring  and  Lo¬ 
cal  Likelihood. 

In  this  appendix,  we  sketch  a  proof  that  in  the  exponential  family,  the  local  likelihood  estimate 
at  a  single  X  value  asymptotically  satisfies  the  local  scoring  update  equation  (19). 

We  assume  that  the  Yi  ’s  are  independent  with  density  /y(y;^,^)  =  exp{[yO  -  &(0)]/a(<£)  + 
c(y.^)}-  have  E Y  =  b'(6)  =  fi,  VarY  =  b  (0)  =  and  p.  is  linked  to  a  a  single 
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covariate  X  by  q  —  s(x)  —  g((i).  Consider  estimation  of  s(*)  at  a  fixed  point  xo,  with  a  sample 
size  of  kn  points  in  the  neighbourhood  Nq,  and  assume  that  kn  — ►  oo  but 

max{i,jeN^}  I Xi  -1,1  =  o{k~ 1/2).  (51) 

We  make  the  simplifying  assumption  that  the  x,-’s  are  equally  spaced,  hence  the  local  likelihood 
estimate  in  the  jth  neighbourhood  is  just  /iy  =  yj.  The  local  scoring  step  is  q1  =  Smooth  [q  + 
(y— with  weights  Assuming  that  g(fi)  and  b(0)  have  two  bounded  derivatives, 

we  can  expand  each  in  a  Taylor  Series,  and  using  (51)  ,  the  local  scoring  step  can  be  written 
as 

T)l  -  Ave0(»?°)  +  Ave0(yy  -  fij)  +  o(l)  =  0  (52) 

where  Ave0  represents  the  (unweighted)  mean  over  j  G  Nq.  Thus  it  is  sufficient  to  show  that 
N  ~  V)  an^  Wj  =  y(yj)  satisfy  Aveo(y,-  —  ftj)  =  o(l)  and  r?o  —  Aveo(*7,)  =  o(l),  respectively. 
Assuming  that  the  second  moment  of  Y  is  well-behaved  enough  to  allow  application  of  the 
weak  law,  each  of  these  follow  by  standard  Taylor  series  arguments. 
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